Analysis of differential gene expression has been proved useful in many fields and applications. Related to RNA sequencing, differential expression analysis means taking the normalized read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, the use of statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.
In this report we are going to use different methods to analyze the raw RNA-seq data of some uterine cancer samples.
Uterine cancer is a disease that develops in the tissues of the uterus. The uterus or womb is a secondary sex organ of the female reproductive system in most mammals, including humans, where the fetus develops during gestation. Depending on the specific area where a tumor develops, this kind of cancer can be classified in two types: endometrial cancer forms from the lining of the uterus and uterine sarcoma forms from the muscles or support tissue of the uterus.
The samples used in this report belong to the first, more common and less aggressive type mentioned. Although it can usually be cured, the endometrial carcinogenesis is thought to be caused by a multi-step process that involves the interaction of hormonal regulation, gene mutation, adhesion molecules and apoptosis, and as many other complex types of cancer, it’s molecular pathogenesis is not completely described.
The samples used for the analysis come from a cohort of The Cancer Genome Atlas (TCGA), and we are testing the hypothesis that differential expression between tumor and control samples can reveal specific gene sets involved in the mentioned unknown pathogenesis.
As the input data is in raw RNA read counts, the first part of the project will be a summary and information extraction of the dataset.
We start importing the raw table of counts. We can observe 589 patients and 20115 transcripts.
library(SummarizedExperiment)
se <- readRDS(file.path("rawCounts", "seUCEC.rds"))
se ##589 samples // 20115 gene transcripts
class: RangedSummarizedExperiment
dim: 20115 589
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(589): TCGA.2E.A9G8.01A.11R.A40A.07
TCGA.4E.A92E.01A.11R.A37O.07 ... TCGA.FL.A1YV.11A.12R.A32Y.07
TCGA.FL.A3WE.11A.11R.A22K.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
#se$gender
table (se$gender)
FEMALE
556
table(se$type) #35 normal and 554 tumor.
normal tumor
35 554
table(se$gender, se$type)
normal tumor
FEMALE 23 533
sum(is.na(se$gender)) ##number of NA samples in termes of gender
[1] 33
We can observe the information of the gender from our object data. We exepected to have all female samples but we noticed that we have 33 NA samples, that have to be mantained because are normal ones. Since the number of normal samples is smaller than the number of tumor samples it is important to mantain those normal samples in order to perform the comparison. In the last table we can see a Sex x Type Summarize, and we can see that only 23 samples are cataloged as female and normal.
We can observe the phenotypic data information that is related to the patient’s samples. Notice that we are working with a s4 object which is strict, formal and rigurous. We can observe 589 patients and 20115 transcripts. Also there are 549 clinical variables for each of the 589 patients.
dim(colData(se)) #549 clinical variables for each of the 589 patients
[1] 589 549
colData(se)[1:5, 1:5] #example of 5 rows (patients) and 5 columns
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.2E.A9G8.01A.11R.A40A.07 tumor 9583C10B-B21A-4863-98FA-61E735E64EA5
TCGA.4E.A92E.01A.11R.A37O.07 tumor NA
TCGA.5B.A90C.01A.11R.A37O.07 tumor 16AC4341-CF8F-45E2-B90B-2D12D5F74A59
TCGA.5S.A9Q8.01A.11R.A40A.07 tumor 8751429B-4A11-451E-B978-DC9E9DB0EB36
TCGA.A5.A0G1.01A.11R.A118.07 tumor 53707bb3-426a-43cb-830f-3eeed930295f
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.2E.A9G8.01A.11R.A40A.07 TCGA-2E-A9G8 2014-5-21
TCGA.4E.A92E.01A.11R.A37O.07 NA NA
TCGA.5B.A90C.01A.11R.A37O.07 TCGA-5B-A90C 2014-4-17
TCGA.5S.A9Q8.01A.11R.A40A.07 TCGA-5S-A9Q8 2013-12-23
TCGA.A5.A0G1.01A.11R.A118.07 TCGA-A5-A0G1 2011-1-7
prospective_collection
<factor>
TCGA.2E.A9G8.01A.11R.A40A.07 NO
TCGA.4E.A92E.01A.11R.A37O.07 NA
TCGA.5B.A90C.01A.11R.A37O.07 NO
TCGA.5S.A9Q8.01A.11R.A40A.07 YES
TCGA.A5.A0G1.01A.11R.A118.07 NO
mcols(colData(se), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
These metadata consists of two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be used in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search. The CDE was originally intended to be a standard nomenclature for the reporting of Phase 3 cancer clinical trials data.
Now, explore the row (feature) data.
rowData(se) #Each of the genes symbol, their lenght and the gc content.
DataFrame with 20115 rows and 3 columns
symbol txlen txgc
<character> <integer> <numeric>
1 A1BG 3322 0.564419024683925
2 A2M 4844 0.488232865400495
9 NAT1 2280 0.394298245614035
10 NAT2 1322 0.389561270801815
12 SERPINA3 3067 0.524942940984676
... ... ... ...
100996331 POTEB 1706 0.430832356389215
101340251 SNORD124 104 0.490384615384615
101340252 SNORD121B 81 0.407407407407407
102724473 GAGE10 538 0.505576208178439
103091865 BRWD1-IT2 1028 0.592412451361868
#Gets the range of values in each row of a matrix.
rowRanges(se) #The chromosome in which the gene is located, the range, the strand, the symbol, the lenght and the gc content.
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 58345178-58362751 - | A1BG 3322
2 chr12 9067664-9116229 - | A2M 4844
9 chr8 18170477-18223689 + | NAT1 2280
10 chr8 18391245-18401218 + | NAT2 1322
12 chr14 94592058-94624646 + | SERPINA3 3067
... ... ... ... . ... ...
100996331 chr15 20835372-21877298 - | POTEB 1706
101340251 chr17 40027542-40027645 - | SNORD124 104
101340252 chr9 33934296-33934376 - | SNORD121B 81
102724473 chrX 49303669-49319844 + | GAGE10 538
103091865 chr21 39313935-39314962 + | BRWD1-IT2 1028
txgc
<numeric>
1 0.564419024683925
2 0.488232865400495
9 0.394298245614035
10 0.389561270801815
12 0.524942940984676
... ...
100996331 0.430832356389215
101340251 0.490384615384615
101340252 0.407407407407407
102724473 0.505576208178439
103091865 0.592412451361868
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
The genomic ranges of the transcripts are not the ones that are represented in NCBI database using the hg38 version. Example: A1BG: 58346806..58353499 complement instead of 58345178-58362751 A2M: 9067708..9116229 complement instead of 9067664-9116229
However we consider that is not relevant in relation to our study, because the variance is minimal.
To retrieve the experiment data from a SummarizedExperiment object one can use the assays() accessor. An object can have multiple assay datasets each of which can be accessed using the $ operator. The airway dataset contains only one assay (counts). Here, each row represents a gene transcript and each column one of the samples.
We can observe a table in which there is a representation of gene transcripts (rows) of each of the samples (columns). A summary of the counts is printed in order to see the tendency of our data.
assays(se)
List of length 1
names(1): counts
assays(se)$counts[1:5, 1:5] #Each row represents a gene transcript and each column one of the samples.
TCGA.2E.A9G8.01A.11R.A40A.07 TCGA.4E.A92E.01A.11R.A37O.07
1 48 39
2 13605 31176
9 0 0
10 0 0
12 1356 1887
TCGA.5B.A90C.01A.11R.A37O.07 TCGA.5S.A9Q8.01A.11R.A40A.07
1 20 28
2 13250 2876
9 0 0
10 0 0
12 1478 639
TCGA.A5.A0G1.01A.11R.A118.07
1 77
2 7067
9 0
10 0
12 930
countexpr <- assays(se)$counts
dim(countexpr)
[1] 20115 589
#Summary statistics of the sequencing depth that mapped the genes.
summary(colSums(countexpr))
Min. 1st Qu. Median Mean 3rd Qu. Max.
3345074 17330822 20439586 24645417 33657247 60097229
In the previous summary we can see some visualization of our data, including the header of a table where the rows are genes and the colums are samples (so the number would be the reads mapping that gene in each individual) and some main statistics below.
The main goal of this step is to bring the samples to a level that can be comparable removing the technical differences between them.
We have to take into account 2 steps, within-samples and between-samples normalization:
We need first to load the edgeR R/Bioconductor package and create a `DGEList’ object.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, "results/dge1.rds")
names(dge)
[1] "counts" "samples" "genes"
head(dge$samples) #Data frame as many rows as samples. Group column (natural grouping tool). Lib size adds up counts per sample
group lib.size norm.factors
TCGA.2E.A9G8.01A.11R.A40A.07 1 34953667 1
TCGA.4E.A92E.01A.11R.A37O.07 1 37504801 1
TCGA.5B.A90C.01A.11R.A37O.07 1 51816834 1
TCGA.5S.A9Q8.01A.11R.A40A.07 1 10067025 1
TCGA.A5.A0G1.01A.11R.A118.07 1 19273702 1
TCGA.A5.A0G2.01A.11R.A16W.07 1 20526431 1
#lib size is read addition in a column
head(colSums(assays(se)$counts)) ##We can confirm it looking at these values
TCGA.2E.A9G8.01A.11R.A40A.07 TCGA.4E.A92E.01A.11R.A37O.07
34953667 37504801
TCGA.5B.A90C.01A.11R.A37O.07 TCGA.5S.A9Q8.01A.11R.A40A.07
51816834 10067025
TCGA.A5.A0G1.01A.11R.A118.07 TCGA.A5.A0G2.01A.11R.A16W.07
19273702 20526431
#norm factors (by deafult is 1)
Now calculate \(\log_2\) CPM values of expression and put them as an additional assay element to ease their manipulation.
assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
#Log of 0 does not exist so we use a prior count value that is added in all the values (log between 0 and 1 is negative).
assays(se)$logCPM[1:5, 1:5]
TCGA.2E.A9G8.01A.11R.A40A.07 TCGA.4E.A92E.01A.11R.A37O.07
1 0.4787484 0.08427463
2 8.6045526 9.69917999
9 -5.6232476 -5.62324755
10 -5.6232476 -5.62324755
12 5.2785238 5.65345688
TCGA.5B.A90C.01A.11R.A37O.07 TCGA.5S.A9Q8.01A.11R.A40A.07
1 -1.299515 1.486274
2 7.998470 8.158385
9 -5.623248 -5.623248
10 -5.623248 -5.623248
12 4.835107 5.988568
TCGA.A5.A0G1.01A.11R.A118.07
1 2.005532
2 8.518400
9 -5.623248
10 -5.623248
12 5.593132
A very first basic QA diagnostic of expression profiles in RNA-seq data is to examine the sequencing depth of mapped reads in increasing order with a bar plot. Labeling/coloring bars for each grup of samples can help to identify problems.
Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample:
[1] 14516150674
[1] 24645417
Figure 1: Library sizes in increasing order
We have colored the bars using the type condition (tumor/normal) because we want to spot differences of expression between these conditions.
If the tumor samples would be in one site and the other ones in the other, it would be a problem, meaning that the control and tumor samples have been differently measured.
As there is a large amount of samples, it is difficult to distinguish between tumor and normal samples in this plot, also because there are much more control samples than tumor ones. In further steps, it may be necessary to work with a subset in order to obtain clearer results. However, the figure S1.1 reveals substantial differences in sequencing depth between samples and it may be considered discarding those samples whose depth is substantially lower than the rest.
We can not distinguish the type of the samples using the color approach because of the huge number of samples so we should work with a subset to observe clearer results. We can only observe that some samples have lower depth than others. To identify which are these samples we may simply look at the actual numbers including portion of the sample identifier that distinguishes them.
sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
s <- sampledepth[sampledepth > 24.6]
sort(s)
AX.A2HC AX.A2H4 AX.A060 B5.A3S1 AX.A3G7 AJ.A3OK AX.A3FX FL.A1YH EO.A3AS
24.7 25.0 25.4 25.7 25.9 26.1 26.2 27.3 28.4
AX.A3GB JU.AAVI FL.A1YQ B5.A3FC DF.A2KZ AX.A3G1 DF.A2L0 AJ.A3BD AJ.A8CT
28.5 28.5 28.8 28.9 28.9 29.0 29.4 29.8 29.9
B5.A3FH BK.A0CA EY.A1GO DI.A1C3 AJ.A3BH AX.A3FW DI.A2QY EO.A3KU DF.A2KV
29.9 30.0 30.0 30.2 30.5 30.5 30.5 31.1 31.3
DI.A2QY D1.A2G0 PG.A917 AJ.A3EJ B5.A3FB FL.A1YV AX.A3GI B5.A0JR A5.A2K4
31.6 31.7 31.7 31.8 31.9 32.2 32.3 32.3 32.7
E6.A2P8 EO.A3KX EY.A2ON QS.A5YQ DF.A2KU AJ.A3OJ AX.A3G4 EO.A3B1 EY.A2OO
32.7 32.8 32.8 33.0 33.2 33.3 33.3 33.3 33.4
AJ.A3NH AP.A3K1 B5.A3F9 A5.A2K2 D1.A3DH AJ.A3BF B5.A3FA BK.A139 D1.A3DG
33.5 33.7 33.7 33.8 33.8 33.9 33.9 34.1 34.3
KP.A3W1 AJ.A3BG BG.A3PP AX.A1C7 DI.A2QT A5.A2K5 AX.A3G3 EO.A3AZ EO.A3AV
34.3 34.4 34.5 34.6 34.6 34.7 34.7 34.7 34.8
K6.A3WQ DF.A2KR BG.A2AD FL.A1YN 2E.A9G8 AX.A3G8 AX.A0J0 AJ.A5DV EY.A3L3
34.8 34.9 34.9 34.9 35.0 35.2 35.3 35.4 35.4
KP.A3W3 FL.A1YF AX.A3G6 EY.A2OP FI.A3PV EO.A3B0 QS.A5YR B5.A3FD AJ.A2QM
35.4 35.5 35.6 35.6 35.6 35.7 35.7 35.8 35.9
EO.A22U BK.A6W4 AX.A3G9 FI.A3PX EO.A22X AJ.A3NC AJ.A3BK AJ.A8CV BG.A3EW
36.0 36.2 36.3 36.4 36.5 36.7 36.9 37.0 37.0
FL.A1YT AX.A3FT BG.A0MK A5.A2K7 4E.A92E AJ.A3EK AX.A2HC AJ.A8CW EO.A1Y7
37.1 37.2 37.3 37.4 37.5 37.5 37.5 37.6 37.6
EY.A2OQ B5.A5OC BK.A56F EO.A3AU EO.A3AY AJ.A5DW B5.A11R EY.A4KR PG.A5BC
37.6 37.9 37.9 37.9 38.0 38.1 38.2 38.2 38.2
E6.A2P9 EO.A3L0 AX.A2IN AX.A05Y FL.A1YI AJ.A3BI DF.A2KY KP.A3W0 FL.A3WE
38.3 38.3 38.7 38.8 38.8 38.9 38.9 38.9 38.9
DF.A2KN EY.A547 E6.A8L9 EY.A54A D1.A3DA FL.A1YL AJ.A3OL AJ.A3EL A5.AB3J
39.0 39.1 39.2 39.2 39.3 39.3 39.5 39.6 39.7
BG.A3PP BK.A13B QF.A5YT PG.A6IB EY.A1GX EO.A22Y BK.A4ZD EY.A1GL EY.A549
39.7 39.8 39.9 40.1 40.2 40.3 40.3 40.5 40.5
AX.A2HD BK.A139 AJ.A3TW AJ.A3I9 EY.A210 KP.A3VZ A5.A3LP EY.A548 EY.A5W2
40.5 41.1 41.4 41.5 41.6 41.7 42.1 42.8 42.8
SJ.A6ZJ AJ.A3QS EY.A1GP B5.A0K9 AJ.A3NC AJ.A3EM D1.A3JP FL.A1YG AJ.A3NE
42.9 43.1 43.1 43.4 43.5 43.8 43.8 44.0 44.1
BK.A4ZD EY.A72D PG.A916 AJ.A3NE AX.A3FS PG.A914 PG.A915 AX.A2HH AX.A3FV
44.3 44.4 44.6 44.7 44.8 44.9 45.2 45.6 45.8
A5.A1OH AJ.A3NH B5.A1MW B5.A5OD FL.A1YU AJ.A3NF AJ.A2QO AX.A3FZ B5.A1MS
46.0 46.0 46.3 46.7 47.0 47.1 47.3 47.3 47.3
AJ.A3IA D1.A3JQ SL.A6JA KP.A3W4 EY.A3QX PG.A7D5 AJ.A3NG AJ.A23N KJ.A3U4
47.9 48.1 48.7 48.8 49.0 49.4 49.5 50.1 50.4
A5.A3LO 5B.A90C B5.A0JN QF.A5YS SJ.A6ZI EO.A3KW BG.A3EW AJ.A6NU QS.A744
51.5 51.8 52.0 53.6 53.6 53.7 54.0 54.2 55.0
A5.A7WJ BK.A6W3 A5.A7WK SL.A6J9 B5.A5OE
55.4 56.9 57.3 58.3 60.1
It is known that in a lot of studies when we want to compare tumoral samples with normal ones those samples came from the same patient and that is the reason why we wanted to know if this is our case. We observe that 23 samples fullfil this condition, so that means that from these patients we have a tumor and normal sample. For this reason we can assume that we have paired data.
We can see that the number of paired samples is the same number as samples catalogued both as female and normals of the first preview of our data, so it makes sense.
Paired data, as we mentioned, comes from experiments where the samples are taken from the same subject so in further steps we could have the advantage that we avoid batch effects and we have a balanced set. To be able to know if a tumor sample and a normal one come from the same patient we have used as an identifier the bcr barcode.
frame_patients <- data.frame(table(colData(se)$bcr_patient_barcode))
table(frame_patients$Freq)
1 2
522 23
#As we can observe, from the data we can only obtain 23 paired samples according to the barcode. Let's apply the filter to obtain the subset.
se_x2 <- se[, colData(se)$bcr_patient_barcode %in% frame_patients$Var1[frame_patients$Freq == 2]] ## %in% operator in R, is used to identify if an element belongs to a vector, in this case, the patients that are found in a frequence of 2.
se_x2
class: RangedSummarizedExperiment
dim: 20115 46
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(2): counts logCPM
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(46): TCGA.AJ.A2QL.01A.11R.A18M.07
TCGA.AJ.A3NC.01A.11R.A22K.07 ... TCGA.DI.A2QY.11A.11R.A19W.07
TCGA.E6.A1M0.11A.11R.A144.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(colData(se_x2)$type)
normal tumor
23 23
Let’s examine the sequencing depth again using the subset. Figure 2 below shows the sequencing depth per sample:
[1] 1240081311
[1] 26958289
Figure 2: Library sizes in increasing order (paired data)
Thanks to the subsetting we can see more clearly the graph that shows the depth of coverage for each sample. We can see how the samples are not fully grouped according to their origin (tumor or normal), but they are distributed throughout the graph having high and low coverage values for both normal and tumor samples.
It is necessary to clarify that, in this case, by doing this subsetting, we are ignoring lots of data (the majority), but in cancer studies using paired data allows us to obtain more accurate results, so we will begin with this approach and further analyse the data we exclude if necessary.
Also, usually the mean expression should be another threshold to include or exclude data. We can see 2 thresholds: the bottom one corresponding to the mean expression of the whole dataset, and the one on the top, corresponding to the mean depth of the subset sample. They both cutt-off the subset under the 19 most expressed samples, and it could be considered to filter just those, but for now, we have such a small subset of samples that we are going to mantain all of them, and study below if the low expression affects the analysis, or otherwise keep all these samples.
Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figures 3, 4, 5
Figure 3: Non-parametric density distribution of expression profiles per sample
Figure 4: Non-parametric density distribution of expression profiles per normal sample
Figure 5: Non-parametric density distribution of expression profiles per tumor sample
The distribution of expression levels across samples allows one to identify samples with distinctive RNA composition. In this case we can not observe differences in either kind of plot: the density plots have the same behaviour in tumor and normal samples and the boxplots behave more or less the same, we can also see that the behavior between samples is also very similar within the plots.
Let’s calculate now the average expression per gene through all the paired samples. Figure 6 shows the distribution of those values across genes.
Figure 6: Distribution of average expression level per gene
The previous log CPM plot represents the typically bimodal distribution, with a low-CPM peak representing the most lowly-expressed genes and a high-CPM peak representing genes with a high level of expression. To filter out the lowly-expressed genes, we should choose a threshold between both peaks, and shown as a red line, logCPM = 1 was chosen as this threshold.
Figure 7: qq-plot
We can see with the previous QQ-plot that, indeed, with a threshold of 1 we pick only the genes that are not lowly-expressed (not the horizontal, negative values from the left) and that we also exclude the first part of the increasing exponentially values, which will allow us to avoid technical artifacts. So we will perform the subsetting of the genes using this threshold.
RNA-seq expression profiles from lowly-expressed genes can lead to artifacts in downstream differential expression analyses, as mentioned. For this reason it is common practice to remove them.
And, as stated beforehand, the cuttoff of 1 log CPM unit was chosen, so all the genes with lower expression will not be considered.
[1] 20115 46
[1] 11638 46
Figure 8: Filtering of lowly-expressed genes
In the figure 8 we can see a summary and a visual representation of the genes we kept: just the ones surpassing the cut-off of 1 (a little bit more than half the initial number of genes).
We can store un-normalized versions of the filtered expression data.
saveRDS(se.filt, "results/se.filt.unnorm.rds")
saveRDS(dge.filt, "results/dge.filt.unnorm.rds")
We calculate now the normalization factors on the filtered expression data set. The calcNormFactors is an EdgeR function to perform the between normalization with Trimmed Mean of M-values (TMM).
dge.filt <- calcNormFactors(dge.filt) #EdgeR function to performe the between normalization TMM.
Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25) #Within normalization.
Store normalized versions of the filtered expression data.
saveRDS(se.filt, "results/se.filt.rds")
saveRDS(dge.filt, "results/dge.filt.rds")
MA-plots are useful to compare two group of samples. It concludes how different the two samples are in terms of read counts in RNA-seq experiments. So, we compare one sample at a time against the average of the rest of the samples
We examine now the MA-plots of the normalized expression profiles. Blue line is a reference of what we expect, red line indicates if we are above or below.
We look first to the tumor samples in Figure 9.
Figure 9: MA-plots of the tumor samples
We do not observe any sample that is very different from the others and that therefore has a strange behavior, sometimes some samples biases but normally the red line (mean of normalized values) is closer to the blue line (expected line).
Let’s look now to the normal samples in Figure 10.
Figure 10: MA-plots of the tumor samples
In this case we can observe that some of the normal samples have lines whome tails are not exactly fitting the expected line: maybe can cause problems in further steps of the analysis.
Batch effects can occur because measurements are affected by laboratory conditions, reagent lots, and personnel differences. This becomes a major problem when batch effects are confounded with an outcome of interest and lead to incorrect conclusions. So we have to detect the batch effect to know if the results that we’re going to obtain are reliable.
We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples. Also we obtained all the codes definition in https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/.
ELEMENTS DATA:
TSS (tissue source site):
tss <- substr(colnames(se.filt), 6, 7)
table(tss) #Tissue source site
tss
AJ AX BG BK DI E6
8 20 6 6 4 2
Center in which the samples are sequenced:
center <- substr(colnames(se.filt), 27, 28)
table(center) #Center in which the samples are sequenced
center
07
46
Plates in which the samples are sequenced:
plate <- substr(colnames(se.filt), 22, 25) #Plates in which the samples are sequenced
table(plate)
plate
A00V A104 A109 A118 A137 A144 A16F A17B A18M A19W A22K A27V A32Y
1 2 1 3 6 2 2 4 6 2 10 5 2
Analyte used to sequence:
portionanalyte <- substr(colnames(se.filt), 18, 20) #Analyte used to sequence
table(portionanalyte)
portionanalyte
11R 12R 21R 22R 32R 33R
38 3 2 1 1 1
Sample type:
samplevial <- substr(colnames(se.filt), 14, 16) #Type to which the samples belong
table(samplevial)
samplevial
01A 11A
23 23
From this information we can make the following observations:
All samples were sequenced at the same center with the 07 code (University of North Carolina)
All samples belong to one of two combinations of tissue type and vial, matching the expected tumor and normal distribution. (samplevial). Code definition for 01A : Primary Solid Tumor also called TP Code definition for 11A: Solid Tissue Normal.
Samples from Uterine Corpus Endometrial Carcinoma were collected across different tissue source sites (TSS). AJ: International Genomics Conosrtium AX: Gynecologic Oncology Group BG: University of Pittsburgh BK: Christiana Healthcare DI: MD Anderson E6: Roswell Park
The samples are sequenced within different plates.
The samples were not sequenced using the same analyte combinations.
Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with TSS, plate and portionalyte.
table(data.frame(TYPE=se.filt$type, TSS=tss))
TSS
TYPE AJ AX BG BK DI E6
normal 4 10 3 3 2 1
tumor 4 10 3 3 2 1
table(data.frame(TYPE=se.filt$type, plate=plate))
plate
TYPE A00V A104 A109 A118 A137 A144 A16F A17B A18M A19W A22K A27V A32Y
normal 0 1 0 1 3 1 1 1 3 1 5 4 2
tumor 1 1 1 2 3 1 1 3 3 1 5 1 0
table(data.frame(TYPE=se.filt$type, portionanalyte=portionanalyte))
portionanalyte
TYPE 11R 12R 21R 22R 32R 33R
normal 20 1 0 1 0 1
tumor 18 2 2 0 1 0
When we classify the tumor and normal samples according to the TSS we can see that the number of tumor samples and the number of normal samples is the same for each TSS. This indicates that the tumor sample and the normal sample of each patient have been obtained in the same batch. Therefore, in this case the batch effect due to differences in the TSS are not affecting the data and is not a a source of expression variability.
To verify that the the TSS is not causing batch effects it is better to examine how the samples are grouped by hierarchical clustering and multidimensional scaling. We are going to repeat this verification with the other two surrogates.
First, we annotated the outcome of interest and the surrogate of batch indicator. We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts.
logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(tss))
batch2 <- as.integer(factor(plate))
batch3 <- as.integer(factor(portionanalyte))
Next we created the dendongram with the representation of the Hierarchical clustering of the samples for each of the surrogates. The resulting dendrograms are shown in Figures 11, 12, 13.
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples (TSS surrogate)")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(tss))), fill=sort(unique(batch)))
Figure 11: Hierarchical clustering of samples (TSS surrogate)
sampleDendrogram2 <- as.dendrogram(sampleClustering, hang=0.1)
names(batch2) <- colnames(se.filt)
outcome2 <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome2) <- colnames(se.filt)
sampleDendrogram2 <- dendrapply(sampleDendrogram2,
function(x, batch2, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch2[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch2, outcome2)
plot(sampleDendrogram2, main="Hierarchical clustering of samples (Plate surrogate)")
legend("topright", paste("Batch", sort(unique(batch2)), levels(factor(plate))), fill=sort(unique(batch2)))
Figure 12: Hierarchical clustering of samples (Plate surrogate)
sampleDendrogram3 <- as.dendrogram(sampleClustering, hang=0.1)
names(batch3) <- colnames(se.filt)
outcome3 <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome3) <- colnames(se.filt)
sampleDendrogram3 <- dendrapply(sampleDendrogram3,
function(x, batch3, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch3[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch3, outcome3)
plot(sampleDendrogram3, main="Hierarchical clustering of samples (Portionanalyte surrogate)")
legend("topright", paste("Batch", sort(unique(batch3)), levels(factor(portionanalyte))), fill=sort(unique(batch3)))
Figure 13: Hierarchical clustering of the samples (Portionanalyte surrogate)
One of the main observations that can be seen in the dendrogram is that the samples are grouped according to whether they are tumoral and normal, and therefore there is no batch effect observed due to TSS, plate and portionalyte.
library(RColorBrewer)
type <- as.integer(factor(se.filt$type))
plotMDS(dge.filt, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(tss))),
fill=sort(unique(batch)), inset=0.05)
Figure 14: Multidimensional scaling plot of the samples (TSS surrogate)
plotMDS(dge.filt, labels=outcome, col=batch2)
legend("bottomright", paste("Batch", sort(unique(batch2)), levels(factor(plate))),
fill=sort(unique(batch2)), inset=0.05)
Figure 15: Multidimensional scaling plot of the samples (Plate surrogate)
plotMDS(dge.filt, labels=outcome, col=batch3)
legend("bottomleft", paste("Batch", sort(unique(batch3)), levels(factor(portionanalyte))),
fill=sort(unique(batch3)), inset=0.05)
Figure 16: Multidimensional scaling plot of the samples (Portionanalyte surrogate)
In the multidimensional plot (Figures 14, 15, 16) we can see how the samples are grouped according to whether they are normal or tumoral. We notice that there is a sample that is separated from the rest, A2HC-tumor. It is clear that it is not clustering with all the other tumor samples, but it is also clear that it is not due to batch effect (in any of three cases studied), so we are going to keep the sample by now, but keeping in mind that a further study of this differential behaviour would be interesting.
In the next step we are going to identify changes in gene expression. The analysis will allow us to answer the next question:
“What is the number of genes that change their activity between tumor and control samples?”
And in further studies we will know what genes are the ones that change their activity.
We are going to work with the normalized gene expression data set that we have obtained in the step before.
We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva.
Surrogate variable analysis (SVA) is a technique that tries to capture sources of heterogeneity in high-throughput profiling data, such as non-biological variability introduced by batch effects.
SVA provides a function to quickly perform F-tests for detecting genes significantly changing between the full and null models. This enables a quick overview of the impact of adjusting for the estimated heterogeneity.
library(sva)
mod <- model.matrix(~ se.filt$type, colData(se.filt))
mod0 <- model.matrix(~ 1, colData(se.filt))
pv <- f.pvalue(assays(se.filt)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.01)
[1] 5570
There are 5570 genes significantly changing their expression at FDR < 1%. In Figure 17 below we show the distribution of the resulting p-values.
Figure 17: Distribution of raw p-values for an F-test on every gene between tumor and normal samples
In the graph we can see the distribution of raw p-values for an F-test on every gene between tumor and normal samples.
A distribution of p -values under the null hypothesis should be uniform. Assuming that most genes are not differentially expressed (DE), the previous histogram should have looked uniform with a peak at low p -values for the truly DE genes. Departures from this indicate data heterogeneity.
sva() function.sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5
sv$n
[1] 11
The SVA algorithm has found 11 surrogate variables. Let’s use them to assess again the extent of differential expression this time adjusting for these surrogate variables.
modsv <- cbind(mod, sv$sv)
mod0sv <- cbind(mod0, sv$sv)
pvsv <- f.pvalue(assays(se.filt)$logCPM, modsv, mod0sv)
sum(p.adjust(pvsv, method="fdr") < 0.01)
[1] 7040
We have increased the number of DE genes (increase of statistical power) to 7040, as we can see, 1470 more genes have been detected as differentially expressed than before (5570 genes). Figure 18 shows the resulting distribution of p-values.
Figure 18: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA
After adjusting for surrogate variables using SVA the p -value distribution is more uniform now for the non-DE genes.
In this section we can examine gene-by-gene changes in expression by comparing expression values between two groups of samples, tumoral and normal ones.
Calculate log fold-changes between tumoral and normal samples for all genes using the rowMeans() function:
Figure 19: MA-plot
In the plot 19 Normal vs. Tumor, if all genes were equally expressed in both samples, the dots (each dot represents one gene) would map a perfect diagonal line. We could observe, though, that there are dots outside the line. It is normal to have some area over and above the line, but it is clear that there are some points way outside the diagonal. Those one would be the ones we are looking for, to see if they are statistically differentially expressed.
In the left, the dots over the diagonal would be those overexpressed in normal and underexpressed in tumor samples, and the ones under the diagonal, the other way around.
In the right plot, which shows the same idea, as the y axis is a tumor-normal take away, all dots over the horizontal 0 line will be overexpressed in tumor samples and underexpressed in normal ones, and the other way around with the dots far away and under the 0 threshold.
We used the volcano plot (https://en.wikipedia.org/wiki/Volcano_plot_(statistics) in order to plot the raw p-values as function of their fold-changes,both in logarithmic scale, 20.
Figure 20: Volcano plot
In the previous figure we can see, in red, the differentially overexpressed (to the right) and underexpressed (to the left) genes, although it’s just a first visualization idea, based on the log fold-change approach, and further analysis will be done in the genes to verify this.
A first straightforward approach to differential expression analysis is to simply rank genes by decreasing absolute value of log 2 fold changes and call DE those at the top (e.g., the top 10) of the ranking:
log2fc <- tumExp - normalExp
ranking <- order(abs(log2fc), decreasing = TRUE)
head(data.frame(Log2FC = round(log2fc[ranking], digits = 3), FC = round(2^log2fc[ranking],
digits = 3), `1/FC` = round(2^(-log2fc[ranking]), digits = 3), row.names = rowData(se.filt)$symbol[ranking],
check.names = FALSE), n = 10)
Log2FC FC 1/FC
MPLKIP -9.201 0.002 588.432
ADH1B -8.513 0.003 365.353
DFFA -8.454 0.003 350.699
PIK3C2A -8.143 0.004 282.677
RNU5E-1 7.175 144.489 0.007
OR4K5 -6.982 0.008 126.444
MIR4298 -6.919 0.008 121.001
SAPCD2 -6.910 0.008 120.233
CNNM1 -6.679 0.010 102.486
XG -6.549 0.011 93.632
Example of result: MPLKIP is 588.432 more expressed in tumor samples than in normal ones so we can say that is an overexpressed gene in the cancer that we are studying.
We are going to perform a curated DE analysis using a linear regression analysis, a statistical technique that attempts to explore and model the relationship between two or more variables. We will use the package of limma in order to perform the linear model.
A general workflow with limma (http://bioconductor.org/packages/limma) for RNA-seq data consists of the following steps: 1. Build design matrix: model.matrix() 2. Calculate observational-level weights (if needed): voom() 3. Estimate correlation in repeated measurements of a blocking factor (if needed): duplicateCorrelation() 4. Fit linear model: lmFit() 5. Build contrast matrix (if needed): makeContrasts() 6. Fit contrasts (if needed): contrasts.fit() 7. Calculate moderated t -statistics ( trend=TRUE if needed): eBayes() 8. Output results: topTable()
We are not to follow all the mentioned steps of the general workflow. Below we will mention the process carried out and the results obtained.
RNA-seq counts may vary considerably from sample to sample for the same gene and different samples may be sequenced to different depths. This may lead to identical CPM values for very different count sizes and motivates the need to model the mean-variance trend of log-CPM values at the individual observation level.
Two ways to incorporate the mean-variance relationship into DE analysis with linear models:
A paired design arises from experiments where there are two measurements per individual and thus one speaks of paired measurements. This is also a common setting when each individual already carries the two conditions of interest (e.g., tissues), and therefore, we can exploit this circumstance to adjust for any characteristics that may affect measurements (e.g., sex, age, BMI, genetic background). To set a paired design, the formula that builds the design matrix should include the pair indicator variable.
table(substr(se.filt$bcr_patient_barcode,1,12))
TCGA-AJ-A2QL TCGA-AJ-A3NC TCGA-AJ-A3NE TCGA-AJ-A3NH TCGA-AX-A05Y
2 2 2 2 2
TCGA-AX-A0IZ TCGA-AX-A0J0 TCGA-AX-A1CF TCGA-AX-A1CI TCGA-AX-A1CK
2 2 2 2 2
TCGA-AX-A2H8 TCGA-AX-A2HA TCGA-AX-A2HC TCGA-AX-A2HD TCGA-BG-A2AD
2 2 2 2 2
TCGA-BG-A3EW TCGA-BG-A3PP TCGA-BK-A0CB TCGA-BK-A13C TCGA-BK-A4ZD
2 2 2 2 2
TCGA-DI-A2QU TCGA-DI-A2QY TCGA-E6-A1M0
2 2 2
In the previous table, the colums are the samples (the 23 individuals we have been working with) and the row is the number of samples for individual (corroborating that the model will take 2 samples per individual, the tumor and normal ones).
When building the design matrix by default we could observe that the reference level of the factor variable is normal. We decided to change it to tumor so that positive values of logFC would mean genes overexpressed in cancer:
se.filt$type <- relevel(se.filt$type, ref="tumor")
Use limma to build the design matrix, fit the model, calculate the moderate t-statistic and obtain the DE genes.
library(limma) #Allow us to obtain differentially expressed genes using linear regresion models.
#Build the design matrix.
mod<- model.matrix(~type + substr(se.filt$bcr_patient_barcode, 1, 12), data = colData(se.filt))
head(mod, n=3)
(Intercept) typenormal
TCGA.AJ.A2QL.01A.11R.A18M.07 1 0
TCGA.AJ.A3NC.01A.11R.A22K.07 1 0
TCGA.AJ.A3NE.01A.11R.A22K.07 1 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 1
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 1
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
# substr(se.filt$bcr_patient_barcode, 1, 12) is the same than just bcr_patient_barcode
#Fit the linear model with the CPM values
fit1 <- lmFit(assays(se.filt)$logCPM, mod)
#Calculate moderated t -statistics using the mean-variance trend:
fit1 <- eBayes(fit1, trend = TRUE)
#Examine the extent of differential expression:
FDRcutoff <- 0.01
res1 <- decideTests(fit1, p.value = FDRcutoff)
summary(res1)
(Intercept) typenormal
Down 4 2819
NotSig 1707 5974
Up 9927 2845
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
Down 0
NotSig 11638
Up 0
dim(res1)
[1] 11638 24
The first block of data represents the coefficients of the model generated (the intercept) and a global summary of what kind of expression we have, compared to tumor type samples. So, around 3000 upstream and downstream regulated genes, and around 6000 for not significantly DE genes.
And the last row is a summary, with our total gene set (around 12000, which fits the 2*3000+6000 summary from the beggining) and 24 columns in the table. Being the first column the block explained before and each of the others a summary for each individual, where the columns is every ID, and the rows are their specific up, down regulated and not significantly expressed genes.
For the genes, we can also take a look at some meta-data, such as the chromosome they are located in:
Add gene meta data:
#Add gene metadata and fetch table of results:
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[,1], stringsAsFactors = FALSE)
fit1$genes <- genesmd
tt1 <- topTable(fit1, coef = 2, n = Inf)
head(tt1)
chr symbol logFC AveExpr t P.Value adj.P.Val
663 chr15 BNIP2 5.903993 4.499887 24.21722 3.902416e-19 2.816480e-15
7251 chr11 TSG101 4.115254 5.904267 24.00524 4.840145e-19 2.816480e-15
390197 chr11 OR4D10 -6.216636 3.637634 -22.59390 2.125834e-18 8.246820e-15
144348 chr12 ZNF664 -5.597645 3.454581 -20.22987 3.092343e-17 8.997173e-14
221692 chr6 PHACTR1 -4.735478 1.851592 -19.90986 4.538549e-17 1.056393e-13
55129 chr3 ANO10 -3.765250 3.464546 -19.61277 6.512421e-17 1.263193e-13
B
663 33.64026
7251 33.43565
390197 32.02226
144348 29.43530
221692 29.06176
55129 28.70961
sort(table(tt1$chr[tt1$adj.P.Val < FDRcutoff]), decreasing = TRUE)
chr1 chr19 chr11
597 405 396
chr2 chr3 chr17
381 325 320
chr12 chr7 chr9
295 282 242
chr5 chr6 chr16
241 240 233
chrX chr14 chr10
207 202 192
chr8 chr4 chr15
188 182 176
chr20 chr22 chr13
175 121 105
chr21 chr18 chrY
78 71 6
chr1_KI270766v1_alt chr17_GL000258v2_alt chr19_KI270921v1_alt
1 1 1
chrUn_GL000218v1
1
DEgenes1 <- rownames(tt1)[tt1$adj.P.Val < FDRcutoff]
DEgenes_symbol1 <- tt1$symbol[tt1$adj.P.Val < FDRcutoff]
length(DEgenes1)
[1] 5664
The importance of knowing in which chromosome the genes are located may be that it could happen that a specific or a few chromosomes are affecting more to this kind of cancer than others. We will take special attention to the chromosomes of the final differentially expressed genes, but, as an overview, we can see that chromosome 1 is the one with most genes involved.
The other important comment about the last table is that there are 12 genes related to Y chromosome, but this samples were all taken from females, so no chromosome Y should had appeared. We will consider it just as a mapping mistake of the reads, but we don’t consider it’s going to affect the analysis, as it’s the chromosome with less genes associated.
And the same happens with the last 4 chromosomes. Some reads were map to those specific chromosomes, that were not correctly identified by the software. In the 3 first cases we can see that the chromosome is annotated, but in the last, not even so. As there is just one gene in each case, we are also not going to consider that as a problem.
Examine diagnostic plots for limma DE analysis with limma-trend in figure 21:
Figure 21: Diagnostic plots for Limma-trend DE analysis
In the previous figure we can visualize the limma-trend results. On the left histogram we can observe a uniformity with a peak at low p-values for the truly DE genes.
In the plot on the right, the black dotted line is formed by all the genes found in this approach that are DE genes.
We are going to perform either the limma-voom approach and for the adjusment of unknown covariates, and compare afterwards the results, to see which one results better for our samples.
Limma-voom 22: calculate precision weights for each gene-by-sample normalized expression value using a new function voom(). Use the resulting matrix of precision weights, jointly with the matrix ofexpression values, to fit the linear models, i.e., in the call to lmFit().
Figure 22: Voom:Mean variance trend
We can observe the mean-variance relation at gene level. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.
Differential expression is then performed as usual by using limma package.
#Fit the linear model using the voom weights:
fit_v <- lmFit(v, mod)
#Calculate the moderated t -statistics:
fit_v <- eBayes(fit_v)
#Examine the extent of differential expression at 1% FDR:
res_v <- decideTests(fit_v, p.value=FDRcutoff)
summary(res_v)
(Intercept) typenormal
Down 8 2961
NotSig 1871 5971
Up 9759 2706
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
Down 0
NotSig 11637
Up 1
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
Down 0
NotSig 11638
Up 0
#Addition of gene metadata and fetch table of results:
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[,1], stringsAsFactors = FALSE)
fit_v$genes <- genesmd
tt_v <- topTable(fit_v, coef = 2, n = Inf)
head(tt_v)
chr symbol logFC AveExpr t P.Value adj.P.Val
663 chr15 BNIP2 5.839693 4.504789 23.09178 9.879978e-19 1.149832e-14
7251 chr11 TSG101 4.029071 5.905098 22.25970 2.433464e-18 1.416033e-14
390197 chr11 OR4D10 -6.101170 3.648108 -21.51151 5.619566e-18 2.180017e-14
779 chr1 CACNA1S 4.668190 5.161293 20.04532 3.134104e-17 7.294941e-14
221692 chr6 PHACTR1 -4.650689 1.871502 -20.09904 2.937143e-17 7.294941e-14
55129 chr3 ANO10 -3.787187 3.468596 -19.67843 4.903007e-17 9.510199e-14
B
663 32.56053
7251 31.98948
390197 30.81490
779 29.43487
221692 29.10236
55129 28.87207
DEgenes_v <- rownames(tt_v)[tt_v$adj.P.Val < FDRcutoff]
DEgenes_symbolv <- tt_v$symbol[tt_v$adj.P.Val < FDRcutoff]
length(DEgenes_v)
[1] 5667
sort(table(tt_v$chr[tt_v$adj.P.Val < FDRcutoff]), decreasing = TRUE)
chr1 chr19 chr11
585 401 399
chr2 chr3 chr17
389 331 328
chr12 chr7 chr5
298 282 242
chr6 chr9 chr16
239 237 229
chrX chr14 chr10
211 205 194
chr4 chr8 chr20
189 183 175
chr15 chr22 chr13
170 116 105
chr21 chr18 chrY
77 73 5
chr1_KI270766v1_alt chr17_GL000258v2_alt chr19_KI270921v1_alt
1 1 1
chrUn_GL000218v1
1
After the same previous steps (generating the matrix model), we can now visualize the results.
Examine diagnostic plots for limma DE analysis with voom weights 23:
Figure 23: Diagnostic plots for Limma-voom DE analysis
Same as with the limma-trend approach, we can see, first, a diagram of p-values with a peak for the low ones (differentially expressed) and a more or less constant frequency for the other ones. This means that the approach is good.
In the right plot, we can also observe similar results than in the limma-trend one.
The next step in our differential expression ananlysis consists in correcting the model for unkown covariates, adding them to the model.
mod<- model.matrix(~type + substr(se.filt$bcr_patient_barcode, 1, 12), data = colData(se.filt))
mod0 <- model.matrix(~ substr(se.filt$bcr_patient_barcode, 1, 12), colData(se.filt))
#estimation of surrogate variables from the log-CPM values calculated by voom
sv <- sva(v$E, mod = mod, mod0=mod0)
Number of significant surrogate variables is: 7
Iteration (out of 5 ):1 2 3 4 5
sv$n
[1] 7
mod_u <- cbind(mod, sv$sv)
colnames(mod_u) <- c(colnames(mod_u)[1:24], paste0("SV", 1:sv$n))
head(mod_u,n=5)
(Intercept) typenormal
TCGA.AJ.A2QL.01A.11R.A18M.07 1 0
TCGA.AJ.A3NC.01A.11R.A22K.07 1 0
TCGA.AJ.A3NE.01A.11R.A22K.07 1 0
TCGA.AJ.A3NH.01A.11R.A22K.07 1 0
TCGA.AX.A05Y.01A.11R.A00V.07 1 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 1
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 1
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 1
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 1
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
TCGA.AJ.A3NH.01A.11R.A22K.07 0
TCGA.AX.A05Y.01A.11R.A00V.07 0
SV1 SV2 SV3
TCGA.AJ.A2QL.01A.11R.A18M.07 0.136756052 -0.11169669 -0.17005068
TCGA.AJ.A3NC.01A.11R.A22K.07 0.103196313 0.07265497 -0.10258811
TCGA.AJ.A3NE.01A.11R.A22K.07 -0.008643974 0.14154657 -0.16342623
TCGA.AJ.A3NH.01A.11R.A22K.07 0.029027736 -0.04019069 -0.09629501
TCGA.AX.A05Y.01A.11R.A00V.07 0.090712161 0.04723105 0.21411584
SV4 SV5 SV6
TCGA.AJ.A2QL.01A.11R.A18M.07 0.09681766 -0.346170514 0.003583838
TCGA.AJ.A3NC.01A.11R.A22K.07 0.31508178 -0.006150784 0.091382487
TCGA.AJ.A3NE.01A.11R.A22K.07 0.15806919 0.198620038 0.220746254
TCGA.AJ.A3NH.01A.11R.A22K.07 -0.11195618 0.207936216 -0.132124399
TCGA.AX.A05Y.01A.11R.A00V.07 -0.35021738 -0.281424109 0.058073372
SV7
TCGA.AJ.A2QL.01A.11R.A18M.07 -0.008817287
TCGA.AJ.A3NC.01A.11R.A22K.07 -0.015456258
TCGA.AJ.A3NE.01A.11R.A22K.07 -0.387483932
TCGA.AJ.A3NH.01A.11R.A22K.07 0.135429475
TCGA.AX.A05Y.01A.11R.A00V.07 0.028333391
There are 9 SVs. Second, we add these SVs to the design matrix, which has the following form:
fit_u <- lmFit(v, mod_u)
fit_u <- eBayes(fit_u)
We examine the extent of differential expression at 1% FDR:
res_u <- decideTests(fit_u, p.value = FDRcutoff)
summary(res_u)
(Intercept) typenormal
Down 1 3371
NotSig 1643 5098
Up 9994 3169
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
Down 0
NotSig 11637
Up 1
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
Down 0
NotSig 11637
Up 1
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
Down 1
NotSig 11637
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
Down 0
NotSig 11638
Up 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0 SV1 SV2
Down 0 706 228
NotSig 11638 9968 10965
Up 0 964 445
SV3 SV4 SV5 SV6 SV7
Down 476 19 2 19 0
NotSig 10590 11605 11636 11609 11638
Up 572 14 0 10 0
#Finally, the metadata is added, the table of results is computed and the number of DE genes is calculated:
fit_u$genes <- genesmd
tt_u <- topTable(fit_u, coef = 2, n = Inf)
saveRDS(tt_u, file.path("results", "tt_u.rds"))
DEgenes_u <- rownames(tt_u)[tt_u$adj.P.Val < FDRcutoff]
DEgenes_symbol_u <- tt_u$symbol[tt_u$adj.P.Val < FDRcutoff]
length(DEgenes_u)
[1] 6540
Though it may not look significant, the DE genes have globally increased, as there are more or less 3500 up and down regulated genes, and just 5000 not DE.
We can also observe the chromosome distribution of DE genes.
sort(table(tt_u$chr[tt_u$adj.P.Val < FDRcutoff]), decreasing = TRUE)
chr1 chr19 chr11
682 466 452
chr2 chr3 chr17
440 380 365
chr12 chr7 chr6
339 317 290
chr5 chr16 chr9
286 273 269
chrX chr14 chr4
244 239 226
chr10 chr8 chr20
221 218 204
chr15 chr22 chr13
193 142 121
chr18 chr21 chrY
81 80 8
chr1_KI270766v1_alt chr17_GL000258v2_alt chr19_KI270921v1_alt
1 1 1
chrUn_GL000218v1
1
head(mod, n=3)
(Intercept) typenormal
TCGA.AJ.A2QL.01A.11R.A18M.07 1 0
TCGA.AJ.A3NC.01A.11R.A22K.07 1 0
TCGA.AJ.A3NE.01A.11R.A22K.07 1 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NC
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 1
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NE
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 1
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AJ-A3NH
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A05Y
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0IZ
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A0J0
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CF
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CI
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A1CK
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2H8
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HA
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HC
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-AX-A2HD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A2AD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3EW
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BG-A3PP
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A0CB
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A13C
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-BK-A4ZD
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QU
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-DI-A2QY
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
substr(se.filt$bcr_patient_barcode, 1, 12)TCGA-E6-A1M0
TCGA.AJ.A2QL.01A.11R.A18M.07 0
TCGA.AJ.A3NC.01A.11R.A22K.07 0
TCGA.AJ.A3NE.01A.11R.A22K.07 0
We can also examine the corresponding plots of this model in @ref(fig: diagnostic-plot-Adjust-for-unknown-covariates:
Figure 24: Diagnostic plots for Adjust for unknown covariates
The plots look similar again. In the left figure we see highly differenciated peak for DE genes. And in the right one, we can see black dots 20 quantiles far away from the red line. We are specially interested in the standing alone dots of the extrems.
The so-called volcano plot is a widely used diagnostic plot in DE analysis to assess the extent of DE by plotting the raw p-values as function of their fold-changes, both in logarithmic scale.
As all approaches were more or less the same, we can visualize the most differentially expressed genes in all of them, with a volcano plot. We showed before the figure, but here we can show the specific most DE genes.
Volcano plot for Model 1 25:
Figure 25: Model 1 - Volcano plots of DE
chr symbol logFC AveExpr t P.Value adj.P.Val
663 chr15 BNIP2 5.903993 4.499887 24.21722 3.902416e-19 2.816480e-15
7251 chr11 TSG101 4.115254 5.904267 24.00524 4.840145e-19 2.816480e-15
390197 chr11 OR4D10 -6.216636 3.637634 -22.59390 2.125834e-18 8.246820e-15
144348 chr12 ZNF664 -5.597645 3.454581 -20.22987 3.092343e-17 8.997173e-14
221692 chr6 PHACTR1 -4.735478 1.851592 -19.90986 4.538549e-17 1.056393e-13
B
663 33.64026
7251 33.43565
390197 32.02226
144348 29.43530
221692 29.06176
Volcano plot for Model 2 26:
Figure 26: Model 2 - Volcano plots of DE
chr symbol logFC AveExpr t P.Value adj.P.Val
663 chr15 BNIP2 5.839693 4.504789 23.09178 9.879978e-19 1.149832e-14
7251 chr11 TSG101 4.029071 5.905098 22.25970 2.433464e-18 1.416033e-14
390197 chr11 OR4D10 -6.101170 3.648108 -21.51151 5.619566e-18 2.180017e-14
779 chr1 CACNA1S 4.668190 5.161293 20.04532 3.134104e-17 7.294941e-14
221692 chr6 PHACTR1 -4.650689 1.871502 -20.09904 2.937143e-17 7.294941e-14
B
663 32.56053
7251 31.98948
390197 30.81490
779 29.43487
221692 29.10236
Volcano plot for Model 3 27:
Figure 27: Model 3 - Volcano plots of DE
chr symbol logFC AveExpr t P.Value adj.P.Val
23094 chr19 SIPA1L3 6.273280 4.374782 26.42984 2.255357e-16 1.438171e-12
390197 chr11 OR4D10 -6.185478 3.648108 -26.14472 2.753457e-16 1.438171e-12
779 chr1 CACNA1S 4.546686 5.161293 25.72513 3.707263e-16 1.438171e-12
663 chr15 BNIP2 5.965787 4.504789 25.25117 5.216656e-16 1.517786e-12
81626 chr1 SHCBP1L -2.171171 7.764988 -24.61222 8.348164e-16 1.943119e-12
B
23094 27.35205
390197 27.16273
779 27.14925
663 26.63682
81626 26.44553
The volcano plots for the three models look quite similar, but we can see that the top DE genes are not the same. Nevertheless, it is important to considerate that some top genes do appear in all the figures, which makes them more likely to be correctly identified. For the overexpressed top genes (in the right half of the figures), BNIP2 appears in all of them, while CACNA1S and TSG10 appear in two out of three. On the other hand, for the underexpressed genes, OR4D10 is also robust in all the models.
The previous figure 28 is another way of showing the results for model 3. This time, overexpressed genes are in black over the 0 horizontal threshold, and the underexpressed are below it. The red dots are the non-DE genes.
Figure 28: MA plot for the Model 3
As we can observe in the different examination diagnostic plots the results obtained with the different models are similar. However Model 3 (Adjust for unknown covariates) is the one that is going to be used for the Funtional Enrichment Analysis as the outcome of interest is not confounded with other sources of variation. With this method we could detect the highest number of DE genes (6540).
length(DEgenes1)
[1] 5664
length(DEgenes_v)
[1] 5667
length(DEgenes_u)
[1] 6540
saveRDS(DEgenes_u, file.path("results", "DEgenes.rds"))
Given a list of differentially expressed (DE) genes, we may know the function and role of some of those genes within the molecular process under study. This may already shed light on what is being investigated. However, even if we knew what every DE gene is doing, we would need to frame their activity within the pathways in which they are participating to make a hypothesis about why they are changing.
There are several R packages at CRAN/Bioconductor that facilitate performing a functional enrichment analysis on the entire collection of GO gene sets. We are going to illustrate this analysis with the Bioconductor package GOstats (http://www.bioconductor.org/packages/release/bioc/html/GOstats.html). Doing this analysis with GOstats (http://www.bioconductor.org/packages/release/bioc/html/GOstats.html) consists of the following three steps: 1. Build a parameter object with information specifying the gene universe, the set of DE genes, the annotation package to use, etc. 2. Run the functional enrichment analysis. 3. Store and visualize the results.
library(GOstats)
Loading required package: Category
Loading required package: Matrix
Attaching package: 'Matrix'
The following object is masked from 'package:S4Vectors':
expand
Loading required package: graph
Attaching package: 'graph'
The following object is masked from 'package:XML':
addNode
Attaching package: 'GOstats'
The following object is masked from 'package:AnnotationDbi':
makeGOGraph
library(org.Hs.eg.db)
#Gene Universe
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11638
#Differential expressed genes:
length(DEgenes_u)
[1] 6540
#Overexpressed genes in tumor samples:
over_DEgenes <- rownames(tt_u)[tt_u$adj.P.Val < FDRcutoff & tt_u$logFC > 0]
length(over_DEgenes)
[1] 3169
#Underexpressed genes in tumor samples:
under_DEgenes <- rownames(tt_u)[tt_u$adj.P.Val < FDRcutoff & tt_u$logFC < 0]
length(under_DEgenes)
[1] 3371
params <- new("GOHyperGParams", geneIds=DEgenes_u, universeGeneIds=geneUniverse, annotation="org.Hs.eg.db", ontology="BP",pvalueCutoff=0.05,testDirection="over")
over_params <- new("GOHyperGParams", geneIds=over_DEgenes,universeGeneIds=geneUniverse,annotation="org.Hs.eg.db", ontology="BP", pvalueCutoff=0.05, testDirection="over")
under_params <- new("GOHyperGParams", geneIds=under_DEgenes,universeGeneIds=geneUniverse,annotation="org.Hs.eg.db", ontology="BP", pvalueCutoff=0.05, testDirection="over")
The four previous parameters represent: - Gene universe (N): the total, filtered genes we considered in the DE analysis (11638). - DE genes (m): the result from the previous analysis, the statistically significant DE genes (6540). - The overexpressed genes, a subset of the DE genes (as mentioned, possitive logFC, which means they appear in tumor samples more frequently than in normal ones), with 3169 in total. - The underexpressed genes. Same as before but for negative logFC values. This genes are less frequent in tumor samples (3398).
conditional(params) <- TRUE
hgOver <- hyperGTest(params)
hgOver
Gene to GO BP Conditional test for over-representation
12164 GO BP ids tested (187 have p < 0.05)
Selected gene set size: 5329
Gene universe size: 9394
Annotation package: org.Hs.eg
#Store the html
htmlReport(hgOver, file = "gotests.html")
browseURL("gotests.html")
We can see that the gene univers is smaller than before because not all the gens have GO anotations or ids, so the package could not identify or annotate all of them.
After filtering genes without GO id, we can get a first look to the results:
#Visualize the results in R:
goresults <- summary(hgOver)
head(goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0008203 0.0002070372 2.639067 40.27667 55 71
2 GO:0016331 0.0003739635 2.303836 47.65127 63 84
3 GO:1902653 0.0003778520 4.443019 19.28742 29 34
4 GO:0006066 0.0004892316 1.705489 102.10986 124 180
5 GO:0060627 0.0013217691 1.476800 155.43389 180 274
6 GO:0016126 0.0016370289 3.282021 20.98925 30 37
Term
1 cholesterol metabolic process
2 morphogenesis of embryonic epithelium
3 secondary alcohol biosynthetic process
4 alcohol metabolic process
5 regulation of vesicle-mediated transport
6 sterol biosynthetic process
#To try to spot the more interesting and reliable GO terms we can filter the previous results by a minimum value on the Count and Size columns, a maximum Count value, and order them by the OddsRatio column:
#We used 10 for the counts because the number of count was bigger than 10.
goresults <- goresults[goresults$Size >= 10 & goresults$Size <= 300 & goresults$Count >= 10, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
#Store the html (filtered ones)
dim(goresults)
[1] 117 7
head(goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
51 GO:0032700 0.018333292 7.640534 6.240047 10 11
52 GO:0043162 0.018333292 7.640534 6.240047 10 11
53 GO:0071711 0.018333292 7.640534 6.240047 10 11
54 GO:0099632 0.018333292 7.640534 6.240047 10 11
15 GO:0045540 0.006000840 5.740863 9.639007 15 17
18 GO:0048701 0.006038347 5.734381 9.643709 15 17
Term
51 negative regulation of interleukin-17 production
52 ubiquitin-dependent protein catabolic process via the multivesicular body sorting pathway
53 basement membrane organization
54 protein transport within plasma membrane
15 regulation of cholesterol biosynthetic process
18 embryonic cranial skeleton morphogenesis
After filtering a minimum count value, we can see that the top results are more related to our disease, or to general cancer processes than the previous one.
Several studies have proved that “IL-17” plays an important role in the pathophysiology of cancer, from tumorigenesis, proliferation, angiogenesis, and metastasis, to adapting the tumour in its ability to confer upon itself both immune, and chemotherapy resistance .
The second GO term represented in the above table that is also called “Ubiquitin-dependent protein catabolic process”" via the MVB pathway involve genes that are important on cancer tumorigenesis and/or progression such as “TSG101” . In previous studies in endometrial cancer patients, aberrant splicing of “TSG101”" gene appeared to be identified more frequently in cancerous than in non-cancerous tissues. Furthermore, functional proteomic analysis of genetically-defined human ovarian cancer models revealed that “TSG101”" is dysregulated in human ovarian epithelial cells expressing oncogenic “HRAS” or “KRAS” .
On the other hand, data from profiling of cancer tissues demonstrate critical contribution of cholesterol metabolism to cancer origin .
It is always interesting to see what genes are involved in the enriched pathways, and compare them to our top previously obtained DE genes. We can do that by accessing the “symbol” term of the GO object.
geneID <- geneIdsByCategory(hgOver)[goresults$GOBPID]
geneSYM <- sapply(geneID, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
set1 <- geneSYM[1:6]
set1
$`GO:0032700`
[1] "ARG2" "IFNG" "IL12B" "TNFSF4" "IL27RA" "IL36RN" "FOXP3"
[8] "VSIR" "MIR26A1" "MIR26A2"
$`GO:0043162`
[1] "TSG101" "HDAC6" "TOM1L1" "VPS4A" "RNF115" "VPS28" "UBAP1" "RNF126"
[9] "MVB12B" "VPS37A"
$`GO:0071711`
[1] "CTSS" "DAG1" "GAS2" "HPN" "ITGB1" "SPINT2" "CLASP2" "CLASP1"
[9] "FERMT1" "PHLDB2"
$`GO:0099632`
[1] "HIP1" "RAB8A" "CPLX1" "GRIP1" "NSG1" "VPS35" "GRIPAP1"
[8] "GRIP2" "SNAP47" "GRASP"
$`GO:0045540`
[1] "ACACB" "CYP51A1" "DHCR7" "FDFT1" "FDPS" "HMGCR" "KPNB1"
[8] "RAN" "SP1" "SQLE" "SREBF1" "GGPS1" "ABCG1" "GPAM"
[15] "ELOVL6"
$`GO:0048701`
[1] "ALX3" "RUNX2" "GNAS" "SIX1" "TULP3" "TWIST1" "IFT140"
[8] "EIF4A3" "SIX2" "SLC39A1" "SETD2" "SLC39A3" "SIX4" "GRHL2"
[15] "RDH10"
We could notice that none of our top 5 DE genes (represented in Figure 27: Model 3 - Volcano plots of DE) are listed in the previous pathways, but there might be 2 reasons for that:
We only visualize the top 5 genes, but over 6000 were considered to find the pathways and they were selected per counts, so the more genes contributing, the more significant the pathway was. Which means it counted more the amount of genes involved in it, rather than the specific contribution of 1.
We only picked 5 genes in order to make the analysis simpler and more bearable when comparing 3 models, so probably if we had increased the number, some of the top genes would appear in the list
Nevertheless, all the pathways are significant in cancer pathogenesis and also there appear highly significant genes, such as the mentioned TSG101 and FOXP3 , so we consider that our approach worked fine.
As we made the distinction of those genes that are overexpressed and underexpressed in tumor samples, we can repeat the process to detect enriched sets taking into account this.
For overexpressed genes:
conditional(over_params) <- TRUE
over_hgOver <- hyperGTest(over_params)
over_goresults <- summary(over_hgOver)
head(over_goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0071480 3.103515e-05 10.746855 4.081329 12 15
2 GO:1902547 7.728550e-04 10.731554 2.720886 8 10
3 GO:0006066 1.168096e-03 1.641336 48.975942 68 180
4 GO:0035295 2.029701e-03 1.312613 157.267192 188 578
5 GO:2001034 2.299822e-03 9.386426 2.448797 7 9
6 GO:1902036 3.091555e-03 2.365998 12.788163 22 47
Term
1 cellular response to gamma radiation
2 regulation of cellular response to vascular endothelial growth factor stimulus
3 alcohol metabolic process
4 tube development
5 positive regulation of double-strand break repair via nonhomologous end joining
6 regulation of hematopoietic stem cell differentiation
over_goresults <- over_goresults[over_goresults$Size >= 10 & over_goresults$Size <= 300 & over_goresults$Count >= 10 , ]
over_goresults <- over_goresults[order(over_goresults$OddsRatio, decreasing=TRUE), ]
head(over_goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0071480 3.103515e-05 10.746855 4.081329 12 15
7 GO:0090630 3.643107e-03 3.579140 5.713860 12 21
29 GO:0090169 1.023404e-02 3.353299 4.897594 10 18
26 GO:0051293 7.863013e-03 3.279590 5.441771 11 20
17 GO:1905508 6.049304e-03 3.220755 5.985948 12 22
9 GO:0045824 4.316477e-03 2.877382 7.890568 15 29
Term
1 cellular response to gamma radiation
7 activation of GTPase activity
29 regulation of spindle assembly
26 establishment of spindle localization
17 protein localization to microtubule organizing center
9 negative regulation of innate immune response
For underexpressed genes:
conditional(under_params) <- TRUE
under_hgOver <- hyperGTest(under_params)
under_goresults <- summary(under_hgOver)
head(under_goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0006221 0.0009313583 4.194545 6.494145 14 22
2 GO:0090231 0.0011533687 16.753435 2.361507 7 8
3 GO:0090266 0.0011533687 16.753435 2.361507 7 8
4 GO:0015985 0.0012080726 4.791742 5.313392 12 18
5 GO:0071392 0.0014394037 5.987152 4.132638 10 14
6 GO:0034080 0.0017205027 3.135425 8.855653 17 30
Term
1 pyrimidine nucleotide biosynthetic process
2 regulation of spindle checkpoint
3 regulation of mitotic cell cycle spindle assembly checkpoint
4 energy coupled proton transport, down electrochemical gradient
5 cellular response to estradiol stimulus
6 CENP-A containing nucleosome assembly
under_goresults <- under_goresults[under_goresults$Size >= 10 & under_goresults$Size <= 300 & under_goresults$Count >= 10 , ]
under_goresults <- under_goresults[order(under_goresults$OddsRatio, decreasing=TRUE), ]
head(under_goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
5 GO:0071392 0.0014394037 5.987152 4.132638 10 14
4 GO:0015985 0.0012080726 4.791742 5.313392 12 18
12 GO:0034367 0.0031770737 4.788997 4.427826 10 15
10 GO:0045740 0.0027829946 4.390840 5.018203 11 17
11 GO:0048485 0.0027829946 4.390840 5.018203 11 17
1 GO:0006221 0.0009313583 4.194545 6.494145 14 22
Term
5 cellular response to estradiol stimulus
4 energy coupled proton transport, down electrochemical gradient
12 protein-containing complex remodeling
10 positive regulation of DNA replication
11 sympathetic nervous system development
1 pyrimidine nucleotide biosynthetic process
This differenciation of over and under expression allowed us to reafirm the results, as the pathways overexpressed are related to pathogenesis, and the underexpressed are related to damage control and cell cycle activities.
One example is “GO:0071392”: Previous studies have proved that estrogen plays an essential role in endometrial cancer cell proliferation .
A 23 paired data analysis enabled us to avoid batch effect and to have a balanced set being the main source of variation driven by the tumor and normal condition.
6540 genes have been detected as differentially expressed genes using the Adjust for unknown covarites method.
The functional enrichment analysis identified GO terms that are related to cancer tumorigenesis and progression.
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
[5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GO.db_3.8.2 org.Hs.eg.db_3.8.2
[3] GOstats_2.50.0 graph_1.62.0
[5] Category_2.50.0 Matrix_1.2-17
[7] sva_3.32.1 genefilter_1.66.0
[9] mgcv_1.8-28 nlme_3.1-139
[11] geneplotter_1.62.0 annotate_1.62.0
[13] XML_3.98-1.20 AnnotationDbi_1.46.0
[15] lattice_0.20-38 edgeR_3.26.4
[17] limma_3.40.2 SummarizedExperiment_1.14.0
[19] DelayedArray_0.10.0 BiocParallel_1.18.0
[21] matrixStats_0.54.0 Biobase_2.44.0
[23] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
[25] IRanges_2.18.1 S4Vectors_0.22.0
[27] BiocGenerics_0.30.0 knitr_1.23
[29] BiocStyle_2.12.0
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 xfun_0.7 splines_3.6.0
[4] htmltools_0.3.6 yaml_2.2.0 RBGL_1.60.0
[7] blob_1.1.1 survival_2.43-3 DBI_1.0.0
[10] Rgraphviz_2.28.0 bit64_0.9-7 RColorBrewer_1.1-2
[13] GenomeInfoDbData_1.2.1 stringr_1.4.0 zlibbioc_1.30.0
[16] codetools_0.2-16 evaluate_0.14 memoise_1.1.0
[19] GSEABase_1.46.0 Rcpp_1.0.1 xtable_1.8-4
[22] BiocManager_1.30.4 XVector_0.24.0 bit_1.1-14
[25] digest_0.6.19 stringi_1.4.3 bookdown_0.11
[28] grid_3.6.0 tools_3.6.0 bitops_1.0-6
[31] magrittr_1.5 RCurl_1.95-4.12 RSQLite_2.1.1
[34] pkgconfig_2.0.2 rmarkdown_1.13 AnnotationForge_1.26.0
[37] compiler_3.6.0
Ahn SH, Edwards AK, Singh SS, Young SL, Lessey BA, Tayade C. IL-17A Contributes to the Pathogenesis of Endometriosis by Triggering Proinflammatory Cytokines and Angiogenic Growth Factors. J Immunol. 2015;195(6):2591–2600. doi:10.4049/jimmunol.1501138
Chang JG, Su TH, Wei HJ, et al. Analysis of TSG101 tumour susceptibility gene transcripts in cervical and endometrial cancers. Br J Cancer. 1999;79(3-4):445–450. doi:10.1038/sj.bjc.6690069
Ghanbari Andarieh M, Agajani Delavar M, Moslemi D, Esmaeilzadeh S. Risk Factors for Endometrial Cancer: Results from a Hospital-Based Case-Control Study. Asian Pac J Cancer Prev. ;17(10):4791–4796. Published . doi:10.22034/apjcp.2016.17.10.4791
Gorin A, Gabitova L, Astsaturov I. Regulation of cholesterol biosynthesis and cancer signaling. Curr Opin Pharmacol. 2012;12(6):710–716. doi:10.1016/j.coph.2012.06.011
Leslie KK, Thiel KW, Goodheart MJ, De Geest K, Jia Y, Yang S. Endometrial cancer. Obstet Gynecol Clin North Am. 2012;39(2):255–268. doi:10.1016/j.ogc.2012.04.001
Svoronos N, Perales-Puchalt A, Allegrezza MJ, et al. Tumor Cell-Independent Estrogen Signaling Drives Disease Progression through Mobilization of Myeloid-Derived Suppressor Cells. Cancer Discov. 2017;7(1):72–85. doi:10.1158/2159-8290.CD-16-0502
Young T.W., Mei F.C., Rosen D.G., Yang G., Li N., Liu J., Cheng X. Up-regulation of tumor susceptibility gene 101 protein in ovarian carcinomas revealed by proteomics analyses. Mol. Cell. Proteomics. 2007;6:294–304. doi: 10.1074/mcp.M600305-MCP200.